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STATISTICAL METHODS, SOME OLD, SOME NEW 
A TUTORIAL SURVEY 

Donald P. Gaver 

This report is a revised and expanded version of material 
on statistics presented at the Summer School on Remote Sensing in 
Meteorology, Oceanography, and Hydrology , held at the University 
of Dundee, Scotland during the month of September 1980; its direc- 
tors were Professor A. P. Cracknell and Dr. G. Ostrem. 

The attendees, both students and faculty, were from many 
countries and professional and educational backgrounds. There 
were, for example, physicists, electrical engineers, atmospheric 
and geophysical scientists, physical geographers, photographers — 
and two statisticians: Dr. Ed Wegman of ONR, Washington, and Dr. 

Donald Gaver of the Naval Postgraduate School, Monterey, California. 
The present report was largely assembled by D.G., with inputs from 
E.W. A shortened version was placed on transparencies and formed 
the basis for two one-hour-plus lectures. 

Those who attended these lectures appeared to be quite posi- 
tive in their response. It will be noted that no attempt was made 
to present much formal mathematical material, perhaps to the lis- 
teners' surprise. In view of the background of most participants, 
this seemed appropriate. An attempt was made to lead the listeners 
into some of the approaches and concerns of the data analyst; 
particularly one who might be working with "environmental" data. 

In the course of talking to participants during the school, and 
particularly following these lectures, I discovered that there 



was a good deal of interest in statistical methodology for use 
in the atmospheric and geophysical sciences. I hope to follow 
up on some of the contacts made, and perhaps engage in collabora- 
tive activities. There seem to be many who are interested, and 
much to be done in applying statistical thinking and methodology 
to the various, generally environmental, geophysical, or atmos- 
pheric areas of interest represented at this exciting and stimu- 
lating summer school. 



STATISTICAL METHODS, SOME OLD, SOME NEW 



A TUTORIAL SURVEY 
Donald P. Gaver 



1 . Introduction 

The purpose of the two lectures with the above title is 
to introduce to some, and review for others, selected topics in 
statistical methods. It is hoped that these topics will be use- 
ful to workers in remote sensing. Few details will be given, 
but references will be provided so that those interested can go 
further. 

Statistics can be tentatively defined as the science, 
technology, and art of drawing quantitative inferences from data. 
The subject is always in a process of development, urged by the 
needs of those who have, or plan to obtain, data, and further 
stimulated by conceptual developments and the widening possibili- 
ties of using digital computers for storing, analyzing, displaying 
and finally understanding the meaning of that data. 

1.1. Phases of a Statistical Inquiry 

It seems useful to distinguish several phases of a 
statistical inquiry 

o Objectives of the inquiry, 
o Data acquisition/ 
o Data exploration / 
o Model choice or construction , 
o Confirmatory or validational analysis / 
o Communication of results . 



Our term data acquisition subsumes the choice of what data to 
take in a particular problem setting, experimental design for 
economy and avoidance of bias, etc. This topic is crucially 
important, but for this audience requires a substantive knowledge 
of remote sensing issues that we do not yet possess. It will not 
be discussed here. By data exploration is meant the activity of 
examining data, both graphically and through numerical summaries, 
for the purpose of revealing properties of the data itself, and, 
with luck, of the process giving rise to that data. John Tukey , 
e.g. ( 1977 ) has shown us that there is much to value and learn 
about exploration of data before, or even without, the use of the 
probability theory structure that has traditionally seemed so 
necessary for formal statistics. We will discuss several of these 
simple exploratory approaches in the first lecture. Model choice 
or construction is the phase of inquiry that brings to bear sub- 
ject matter knowledge to "explain" data behavior. Here probability 
theory may well enter to represent measurement errors, or fluctua- 
tions in natural phenomena, such as the heights of wind-driven 
water waves. The confirmatory analysis phase is concerned with 
developing quantitative expressions for uncertainties inherent 
in simple summary statements, and in more complex characteriza- 
tions and predictions made from formal models. This is the formal 
inference phase to which mathematical (especially probability- 
theoretic) methods continue to contribute. 
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2 . 



Simple Graphics and Summaries of Data 

2.1. Stem and Leaf (Successor to Histogram) ; Transformation 



Suppose we have a batch of measurements on some physical 
quantity. We wish to display these numbers in f requency-of- 
occurrence style, losing as little detail as possible. 

Here is such a batch: 



0.92, 1.14, 2.00, 2.66, 6.52, 4.95, 2.76, 2.13, 1.14, 9.72 
11.24, 0.25,. 3.31, 12.66, 0.46, 2.77, 0.35, 66.67, 0.59, 1.84 



To get a stem-and-leaf display draw a vertical line ( stem ) and 
decide on reasonable class intervals: first try intervals 

0 -»■ 10- , 10 -*■ 20- , 20 -*■ 30- , etc . Now put the integer part 

of the number by convention (rounded to the nearest integer) to 
the right of the stem, e.g. 0.92 becomes 1, 1.14 becomes 1, 

...6.52 becomes 7, etc . The stem and leaf appears as follows: 



0 

1 

2 

3 

4 

5 

6 
7 



1123532103030127 

013 



7 



At the end there is a leaf of entries attached to the stem at 0 , 
1, and 7. Imagine that each leaf entry (single integer) occupies 
one unit of area, and hence count frequencies are proportional 
to areas, exactly as in the conventional histogram. Note that 
greater detail concerning number identify is preserved by the 
stem-leaf than is managed by the conventional histogram. 
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Stem-leaf displays are usually constructed by making a 
convenient but informal choice of leaf interval (here it was 10 , 
but for more detail it could be 5, and for more smoothness, hence 
less detail, 20). A more formal approach, see Scott (1979), is 
that of picking the bin size, h n , so as to minimize an integrated 
mean-squared error of estimate of the true density by the histo- 
gram. In summary, if the data is approximately Gaussian then 
the prescription turns out to be 

h = 

n 

where h is bin size for a batch of n, and where s is the 
n 

sample standard deviation. Observe that a few extreme outliers 
in an otherwise Gaussian-appearing batch will unjustifiably expand 
s, and over-coarsen the histogram. A robust estimate of a, 
such as 

~ _ (Upper Quartile) - (Lower Quartile) 

° 1.35 

might then be recommended instead of s. 

A second, more basic, feature of the raw data and the 
display is the crowding in the first class interval: numbers 

like 0** are all jammed together, and there is evidence of 
systematic (right) skewness. In such cases transformation of 
the basic numbers before plotting is often useful, and in this 
particular case a first attempt might be with logarithms (base 
e = 2.7182..., but the particular base doesn't matter). Here 
are the logged numbers: 
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-0.083, 0.13, 0.69, 0.98, 1.87, 1.60, 1.02, 0.76, 0.13, 2.27 
2.42, -1.39, 1.20, 2.54, -0.78, 1.02, -1.05, 4.20, -0.53, 0.61 



Here we might try a scale of 0.5, so, rounding to the nearest 
tenth of an integer: 

Ranges Stem-Leaf 



-1.5 


-2 


-1.0 


-1.5 


-0.5 


-1.0 


- 0 . 


-0.5 


0 . 


0.5 


0.5 


1.0 


1.0 


1.5 


1.5 


2.0 


2.0 


2.5 


2.5 


3.0 


3.0 


3.5 


3.5 


4.0 


4.0 


4.5 



-1 

-1 

-0 

-0 

0 

0 

1 

1 

2 

2 

3 

3 

4 



41 

85 

1 

11 

786 

0020 

96 

34 

5 
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Notice that the previous crowding and skewness has nearly 
disappeared, and we see revealed the vestigial appearance of 
two separate "humps," as well as the originally apparent outlier 
(£n 66.67 ~ 4.2). We are led to investigate the possibility that 
the data derive from two separate sources, with one exotic (mis?) 
measurement thrown in for good luck. In this form the data urges 
more upon us than it did originally. 

2.2. Number Summaries 

Traditionally it has been customary to summarize certain 
features of batches of measurements by moments : the (arithmetic) 

mean gives the "location" of the data set, the standard deviation 
summarizes "spread" (or "scale," or "width"), the third central 
moment measures "skewness," and so on. However, certain apparently 
more primitive measures are useful and have virtues. 



5 



Work with the observations after ordering them into 
x (l) < X ( 2 ) < X (3) < * * * < X (n) * 



(m) 



a) Median, M. 

Compute the median index m = (l+n)/2. 
is the middle (single) number, and if n 



the two middle numbers. Thus the median is 



If n is odd 
even then average 



M 



j X (m) 

* 7 <x ([ml) +x (Im]+l)> 



if n is odd , 



if n is even . 



where [m] is the integer part of m. Notice that quite radical 
changes in extremes, e.g. x (i) or x (n)' effects M not at 
all; in fact changes from x ( n ) to 1000 x ( n )» creating a very 
isolated single observation, leaves M alone, so M is resistant 
(un-influenced) by such changes, or occurrence of outliers. On 
the other hand, the familiar mean responds dramatically and unde- 
sirably and is far from resistant. Thus the median is a useful 
candidate for "placing" or "locating" a rather concentrated batch 
of points when several exotic, separated, extremes are present; 
the mean is apt to strike a meaningless compromise. Of course, 
the median does not well- summarize a batch having two or more 
distinct but nearly equal-sized humps. But certainly it does no 
worse than the mean. Later we discuss some better measures, and 
also some for spread (alternatives to standard deviation) . 
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(b) Quartiles: Lower = Q, Upper = Q 

Roughly one-quarter of the observations fall below Q 
(above Q) . Define 



Q is defined analogously in terms of n-q+1. In other words 
Q(Q) is the median of the lower (upper) half of the ordered 
batch . 

(c) Eighths: Lower = E, Upper = E 

About one-eighth of the observations fall below E (above 
E) . Let 



q = ^-(1 + [m] ) . 



Then 



l 




Q = 




is even . 



e = |(1 + [q]) . 



Then 



I 



if [q] is odd , 



E 




if [q] is even 



and E is defined analogously in terms of n-e + 1. 
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(d) Extremes 

There are simply 



Ext = x ^ , 



Ext = 



(n) 



A seven-number summary of the data 




is as follows: 



MQ - M 

Sq q-q 

ME - M 

s p> = ”= 

E E - E 



Ext 



MExt - M 
Ext-Ext 



The quartile (eighth, extreme) means MQ (ME, 'MExt) can be quickly 
compared to M to detect systematic asymmetry or skewness. 
Dimensionless measures of skewness are also given by the quanti- 



ties 






Q‘ 



’E' 



"Ext' 



Example: x (i) = 



X( 2) 3, 



(3) 



= 5, 



(4) 



= 7 



X( 5) = 9, 



( 6 ) 



= 111 . 



Then m = -J(l+6) = 3.5, 



SO 



M = y(5+7) = 6 . 

Note that the mean x = 22.67. This neither-f ish-nor-f owl 
number has responded heavily to the isolated value 111. Next 
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i 



q = |(1 + [3.5]) = 2 



so 



Q = 3, and Q = 9 . 



Now 



e = |(1 + [2]) = 1.5 , 



SO 



E = 1+3) = 2, and 



E = i(lll+9) = 60 . 



The number summary then is 




0.43 



0.45 



The numbers suggest positive skewness, but closer examination 
reveals that this is caused by the influence of one point alone. 

Note that if. the data is a sample from the Normal distri- 
bution with standard deviation a, then convenient approximate 
estimates of a are 



and 



(Q- Q) 



1 

1.35 



A 




(E - E) 



1 

2. 30 



/A, 
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and thus 



E - E 



1.70 



■ rvy 
_ r\j 

Q - Q 

if data are approximately Normal; this is a handy check. An 
up-to-date test for precise Gaussian behavior is that of Wilk- 
Shapiro (1965) . It involves a suitable linear combination of 
ordered observations as a test statistic. A cautionary note: 
critical values of the test statistic (rejection region) are 
obtained on the basis of independence of batch data values, 
often not an acceptable assumption in practice. 

2.3. Box Plots 

The box plot is a picture of the five-number (omit Eighths) 
summary. Simply draw a rectangular box with ends at Q and Q, 
and in which M is marked. In addition, connect up the extremes. 
As was done with stem-leaf we arrange the "box" vertically. 
Embellishments are sometimes useful: (i) it has been recommended 

that box width be proportional to /n to indicate effect of sample 
size when comparing different batches, (ii) warning points at 
Q + 1.45(Q-Q) and Q-1.45(Q-Q), something like "fences" of 
Tukey: if the data are Gaussian there is only about one percent 

of the data "outside" these values. Box plots are especially good 
for giving a quick appreciation of comparative distributional 
behavior for different batches. 
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Simple Box Plot 




11 



2.4. 



Rooted Histograms ( " Roo to grains " ) 



Hark back to the histogram over bins of width h, with 

j_ i. 

n. observations in the j — bin. The standard error of (n./n) , 
D 3 

the frequency estimate of 

th 

Pj = Probability of an observation in the j— bin 



f x i+l Hl+x -i ) 

= | -* - 1 f (x) dx “f (Xj+h/2) h 



x . 
D 



is at least under random sampling. 



• /n} 1//2 ~ { (n ./n) (1-n ./n) n 



{p j (i - p 3 3' 



which suggests that the magnitude of the sampling fluctuations 
in the high-count (large p^) bins may be much larger than those 
in the low-count bins; on the other hand, the relative fluctua- 
tion magnitudes are greater in the low-count bins. The square- 
root transformation, J nT/n , tends to stabilize (equalize) this 
variation: y~ri~7n tends to estimate ^pT with standard error 
of about 0.5. This suggests several possibilities for useful 
graphical analysis: 

(1) Smoothing a raw histogram by smoothing ynT/n - values and 
re-squaring the results, as possibly in 




so the estimate of p . becomes 
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Other approaches, such as running medians, (see Tukey 
(1977), p. 543 ff . ) are likely to be effective. The 
attempt is to reduce sampling fluctuations without resort- 
ing immediately to the "ultimate smooth:" a simple model. 

(2) Assessing model fit by a hanging rootogram . Having a model 
distribution in mind, perhaps suggested by theory, we wish 
to graphically present the evidence for and against it 
using a histogram of data. To this end, plot 



this is equivalent to looking at a plot of the square root 
of the density, with roots of the histogram values hanging 
from it. If the model is in basic agreement with the data 



lie within unit distance of zero. These differences "should" 
also be nearly pattern-free as order goes. Departures 
vividly show exactly where the model and the data disagree. 
Thus the procedure has a focus, and a specific diagnostic 
slant. It can clearly be used for histograms in more than 
one dimension. 

The above idea, also due to Tukey can be extended to supply 



a formal test of goodness of fit analogous to the classical Chi- 
squared test. At the moment the concern is with graphically 
exposing data features and with indications of model-data 
discrepancies . 




. h 

VS X. + 2 ; 



then about 95% of the values of the differences Aj should 
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3 . More Summary Measures 

In Section 2 we worked with the median, M, as a summary 
measure of location, and with the midspread Q-Q as a summary 
measure of distributional spread. Of course the classical measures 

/T — 11 ~ 2 

would be the mean, x, and standard deviation, s(= £ (x^-x) ) . 

i=l 

The latter are suspect because of their sensitivity to occurrence 
of a few wild values--perhaps recording errors. Here are some 
currently well-regarded alternatives. 

3.1. Winsorized mean; confidence limits 

Suppose a set of nearly symmetrically distributed 
observations are in hand. If a small number of outlying observa- 
tions are feared one can diminish their impact by symmetrically 
Winsorizing to level g (g < 0.20n, n being number in the batch) : 

(a) Order the batch 



(1) <X (2) 



< x 



(g) 



< X 



(g+D 



< X 



< X 



(n-g) (n-g+1) 



< x 



(n) 



(b) Winsorize, level g: 

y (l) = y (2) •** y (g) = x (g+1) <y (g+l) = x (g+l) <y (g+2) < “* 

Y (n-g) = x (n-g) = y (n-g+1) = y (n-g+2) = = y (n) * 

In other words, define the g smallest in the Winsorized batch 

th 

(y's) to equal the g — smallest raw data point, x (g)* Do 
the same symmetrically at the upper end. The result will be tied 
values (g at bottom, g at top) . 
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(c) Average, to get a g-Winsorized mean: 



i 11 

x = — £ y • 

Wg n J i 

Notice that if n is odd and you Winsorize to the extent 
ri - 1 

g = — the result is precisely the median. The Winsorized 
mean can be viewed as a broadened median, with a median's vir- 
tues (oblivious to outliers) , but that uses more of the sample 
information. Winsorizing is named for C. P. Winsor, an insight- 
ful applied statistician. 

(d) Confidence limits using the Winsorized mean. 

Confidence limits attempt to express the uncer- 
tainty inherent in a particular estimate of a fixed quantity 
(such as mean radiation, mean visibility, etc.). If one is 
estimating the mean, y, of a symmetric distribution (possibly 
after transformation) , and if the observations are independent, 
then it is appropriate to use the classical Student's t (or a 
Gaussian approximation) : with confidence (not probability) of 

(1-a) • 100%, 

x + t /0 (n-1) — s y 5 x + t. /9 (n-1) — 

a/2 /n Z - a/2 /iT 

where t D (n) is the 3 • 100% point of a Student's t with n 

p 

"degrees of freedom;" x and s are the sample mean and standard 
deviation, respectively. Such intervals are reasonably satisfac- 
tory even if the data are not quite Gaussian, but if there are 
some extreme outliers, s blows up and the intervals are much 
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too wide. One should also compute the Winsorized limits : after 



computing 2 ^^, 



find 



2 

s Wg 



1 

n-1 



l 



i=l 






2 



and then find the confidence limits 



Wg 



+ t a/2 (n ' 2g " 1) 



Wg 

/n 



2 y < 



x.. + 
Wg 



(n-2g-l) fc l-a/2 (n 2g 1) 



Wg 

/n 



These may be decisively narrower; if so, look for outliers. Pick 
a succession of g-values, but not to exceed 0.20n. 

For more details, see Dixon and Tukey (1968) and Dixon and 
Yuen (1973). 

3.2. Biweights 

The Winsorizing procedure has to a degree been supplanted 
by a procedure called biweights . This involves an iterative least- 
squares calculation with weights on individual observations that 
decrease as the degree apparent non-representativeness (exoticizm) 
of a data point increases. Let .,x n be data points; 

then here is a recipe for fitting a resistant center via biweights: 
referring to the k + 1— iteration in terms of the k-^-. 



x (k+1) 



n 



I w, (k) 
i=l 1 



X . 
1 



n 



l w. OO 

i=l 1 



where the weight of observation i is 
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n 



1 - 



f x i - x (k) 
cs7 



if 



x. - x (k) 
1 

cs. 



< 1 



w i (k) = 



otherwise . 



and s k is a spread-measure, perhaps 



s, = median { | x. - x (k) | } 
X X 



the so-called MAD or median absolute deviation, or alternatively 




the midspread of the residuals at the k — stage. These estimates 
are essentially equivalent. Here c is a tuning parameter; a 
value of c = 6 or 9 has been recommended for general use, while 
c = oo yields up the OLS estimates. 
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4. 



Relations Between Variables, Models, and Model Assessment 



The discovery and establishment of quantitative relation- 
ships between measurable, or classifiable, variables is a primary 
scientific concern. Once a relationship is uncovered it is soon 
likely to be thought to be potentially useful for some human 
purpose, and attempts will be made to apply it. Many examples 
exist in the remote sensing area. Of course the quality, or 
strength, of the relationship must be such as to make its appli- 
cation feasible. Statistical methods and thinking are frequently 
used, sometimes very informally, to help in finding a relationship 
and to expressing it in succinct, often, formally mathematical 
terms (as a mathematical model ), to characterizing its deficiencies 
or biases, and to expressing the uncertainties inherent in its 
use. Here is a brief review of some of these methods. 

4.1. Graphical Plotting for Relationship Exploration 

Many relationships between variables are initially 
guessed from scientific theory; or at least from subject-matter 
knowledge, intuition, or low cunning. When relevant data becomes 
available the obvious first step is to plot it graphically, if 
possible. This is easy if the relationship of interest is between 
two variables: x, an explanatory variable , or condition (some- 

times called factor , or lately carrier ) and y = f (x) , a response . 
Suppose one has observed pairs of these variable values: 

Yi , x i ;i=l , 2 , . . . ,n. Then plots of f(x^) vs x^ and y^ vs x^ 
on the same graph (e.g. rectangular coordinates) are sometimes 
presented for comparison. Since the range of f(x)--and y — may 
be considerable, such plots often obscure real systematic differ- 
ences, it is often wise to plot and study residuals 
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r . = y . - f (x. ) 

1 J i 1 

either vs^ x^ , or vs_ = f(x^); plotting residuals vs_ 

estimated values y^ is one of the obvious ways of evaluating a 
relationship depending on several explanatory variables. Examina- 
tion of a plot often reveals some systematic departure from 
linearity, such as a definite curvature may become evident; 
frequently this may be essentially removed by transformation 
(alternatively called re - expression ) . This is a good idea, for 
comparison of plots to straight lines (looking at residuals from 
straight lines) seems well adapted to human perceptions. The 
powers y y^(p>l, or p<l) or y -* Jin y (zero power) are often 
recommended for transformation to the nearly linear. Of course 
theory (e.g. the laws of physics, such involving squares, etc.) 
may point the way to an approximate linear plot. Then systematic 
deviations may be identified and interpreted on their own merits. 
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4.2. Relationship Fitting 

Although the mathematical form of the relationship 
between response y and explanatory variable x may suggest itself 
from theory, certain constants (parameters) nearly always require 
numerical determination. This means that a mathematical model 
exists, and must be fit to the data. Afterwards, one can look 
at residuals for indications of mis-fit, or apply the model to 
make predictions and check out its errors. One can sometimes 
segment the data, utilize a part to fit the model, and then use 
the remainder to examine the performance of the fitted model. 

This latter procedure is called cross-validation. Again residual 
plots are desirable diagnostics. 

The actual fitting (constant, or parameter, estimation) 
problem can be carried out in various ways. We illustrate in 
terms of a postulated simple linear relation between response, y, 
and explanatory variable, x, i.e. 

y = a + bx , 

with a and b to be determined. Here are some options, 
o Simple least squares 

o Median, two-points 

o Biweights 

Given a set of (y^x^) data, least squares proceeds by minimizing 
the sum of squares 

n ? 

I (y.-a-bx, ) 
i=l 1 1 
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by choice of a and b. Linear equations result for estimated 



a, namely a^g and estimated b, namely Man Y computer 

programs exist for doing this problem, and also for doing the 
multiple regression problem: fit 



Vj-J' 1 ' 2 n e 0 + e i x lj + B 2 x 2 j + • • • + Vpj • 



For better or worse, any set of data can be conveniently fitted 

by such a linear function, i.e. the parameters B-,3-|,...,6 

u x p 

can essentially always be determined numerically, whether or not 
the postulated relationship makes sense. Furthermore, the esti- 



mates, 3., turn out to be linear functions of the y. values. 

1 3 

This suggests that 3^ values also respond linearly to changes — 
perhaps even unfortunate or unsuspected discrepant values or 
errors — in the y^ values, and so a few funny values can give 
a warped picture : 




The fitted line may not follow the main data cluster, but instead 
fasten itself to a single exotic point. For some discussion of 
this and remedies see Mosteller and Tukey (1977) and also Belsley, 
Kuh, and Welsch (1980) . 

The second method mentioned ("median, two-points") 
avoids some of the above difficulties. Prescription: order the 

explanatory variables from smallest to largest: X Q) <x (2) *** 
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x (n/3) < X (n/3+l) X (2n/3 )< 



x, x . Call the set of 
(n) 



values x {1) ,x (2) x (n/3) the low group, and * (2n/3) ' x (2n/3+1) 
... the high group. Let x^ be the median of the low 

group, and x h be the median of the high group. To each member 
of the low group there is a y; let y be the median of the 
corresponding low y's, and y^ be the median of the high y's. 
Now fit: 



MED 



y h' y & 
x h " x £ 



a MED Y h b MED X h 



This procedure is a modification of an ancient procedure that 
utilizes means instead of medians. Since medians are more resistant 
to outliers than means, this procedure is a move in the right direction. 

The third option ("biweights") is an iteratively weighted 
least squares approach. It has been programmed for many computers, 
including evan a hand-held TI-59. This procedure develops a weight 
for each observation; the weight diminishes as the observation 
becomes apparently discrepant, as in the earlier discussion. 

Details are omitted here. 

Example. Sea-surface wind speed and white-cap coverage. 

It has been proposed by Monahan (1971) that magnitudes 
of sea-surface winds and white caps are strongly related. Pre- 
sumably white -cap coverage can be assessed remotely, and wind 
speeds deduced there from. A set of data offered by Toba and 
Chaen (1971) has been analysed using the above methods. An 
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appealing functional form is the power law 



3 

y = ax ; 

it has been argued on theoretical grounds; 3 is supposed to be in 

the neighborhood of 3. A computer scatter plot shows that this is 

1/3 

plausible. An initial cube-root transformation (y / vs_ x, 

3 

equivalent to the model y = ax ) seems to straighten the plot 
reasonably well, but note that a few zero values occur; other 
outliers are less obvious. 

It is natural to try to fit the log-transformed version 
of the power law, for this is now linear: 

Jin y = Jin a + 3 Jin x 

Here are some results (OLS means Ordinary Least Square, Bi(i) 
means the biweight result after i iterations, means robust 

scale after i iterations) ; note that it has been necessary to 
fit Jin (y + start) , start = 0.001 here, in order to avoid the 
embarrassment of logging exact zeros. The "start" value can be 
chosen wisely; no attempt to do so is exhibited here, however. 

PARAMETER FITS FOR LOG-LINEAR MODEL OF WHITE CAPS vs WIND 

Estimate 



Method 


Jin a 


3 


s . 
1 


OLS = B i (1) 


-9.65 


3.51 


1.44 


OLS (zeros out) 


-10.31 


4.22 


— 


B i (2) 


-10.73 


4.24 


1.26 


B i (4) 


-11. 32 


4.68 


1.05 


B i (6) 


-11.39 


4.25 


1.03 
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Apparently the biweight calculation yields quite different results 
from OLS, even after visual culling of apparent zeros has been 
conducted. It would be of interest to utilize the techniques of 
Cook (1977) and Belsley, Kuh , and Welsch (1980) to further identify 
influential observations. Of course the present setup is simple 
enough so that visual examination will reveal most of what is 
present. This would not necessarily be so if the explanatory 
variable were a vector. 

Some computer plots are included to show that the biweight 
procedure provides fits that tend to dramatically reveal the 
presence of outliers. In many cases the presence of such outliers 
represents opportunity for new insight and discovery, as is pointed 
out by Tukey (1977). Outliers should never be immediately "thrown 
away," but rather are candidates for special attention. See 
Kruskal (1960) for some wise discussion of this issue. 

4.3. Fitting Probabilities and the Like 

It is often appealing to explore the relationship between 
the probability of some environmental (or other) event and some 
reasonable explanatory variables. This amounts to estimating 
conditional probabilities. Often the events in question are of 
the form "rain," "fog," "visibility in the range 0-5 kilometers," 
etc . Candidate explanatory variables may be the product of a 
remote sensing system, or a numerical weather forecasting scheme 
("model output statistics") , or a combination thereof. Some 
attempts to use persistence ("it rained in Monterey on January 
14, so the probability of rain on January 15 is *+0.2, whereas 
without rain it is *") are also appealing. In the latter form 
the simple idea of Markov chains has been employed. 
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Ordinary linear regression has been used to predict 
event probabilities, e.g. the REEP scheme of R. G. Miller (1964). 
This has the aesthetic difficulty that probabilities are between 
zero and 1, which ordinary regression doesn't recognize. Two 
ways of dealing with probability regression problems are as 
follows : 

o logistic regression 

o conditional probabilities. 

In logistic regression one utilizes the simple model 

a+bx 

P{event E | explanatory variable X=x} = p(x) = — 

1 , 3.t n x 

+ e 



where a and b are constants to be determined. Although this 

model has been written in terms of one explanatory variable, 

multiple regression options are available. One can contemplate 

fitting the logistic model in several ways: 

(i) By maximum likelihood . Suppose the explanatory 

th 

variable takes on value x. on the i — occasion, and a success 

i 

(event, e.g. rain) occurs, i = 1,2,..., n happenings occur. 

Then the likelihood of the a,b using the data (6^, x^ , 

•to \ , c f 1 if event occurs . ^ 

i=l, 2,... n) , where 6 . = -J /is seen to be 

l 0 otherwise 



L (a ,b ;x) = 



n 

n 

i=l\ 1 + e 



a+bx-; 



6 . 
l , 



, 1 - 6 . 

l 



a+bxn 



t ^ a+bx.; 
,1 + e i 



or 



n n a+bx . 

£(a,b;x) = £n L = £ 6 . (a+bx.) - £ &n[l+e 1 ] 

i=l 1 1 i=l 
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Differentiation, then solution for a, b yields maximum likelihood 
parameter estimates. See Cox (1970), and, recently, McCullagh (1980). 
Pregibon (1980) has generated robust procedures. 

(ii) By grouping . If the data set is large one can 

order the x's, split the ordered values into an equal number 

of groups, g, calculate the frequency of successes in each 
n . 

group Pj = where n^ is the number of successes in group 

j and n is the total number of observations. Now let x^ be 
a representative explanatory variable value for group j ; the 
median may be appropriate. Note that 

Yj = ^Pj/(l“Pj)j vs a + b x. 

should be nearly linear if the logistic relationship holds (if 
not, try a transformation) . In any case it is linear in the 
parameters, so a fit can be readily made and diagnostically 
viewed. One can even fit the above by weighted least squares, 
or perhaps biweights. Such procedures are under investigation 
at the Naval Postgraduate School by the author and Dennis Mar; 
they seem promising. See Cox (1970) , especially Chap. 3, for 
some good discussion. The procedure can be made multivariate 
(x a vector) , provided there is a good deal of data. 

Notice that the above method handles only dichotomous 
situations ("rain, no rain," "visibility in category (*) , visi- 
bility in catagory (not (*))"). A multiclass version of the 
above has been devised by P. Bloomfield, J. Lehoczky and the 
author (possibly also by others) and is under development. 
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Another approach to the multiclass problem is by conditional 



probabilities. Suppose is the event that an observation 

(e.g. of visibility state) lies in class j ( j = 1 ,2 , . . . ,C) , and 
Xj is the corresponding explanatory variable value. One can 
construct an estimate of the density function given that E ^ has 
occurred, f(Xj|Ej), for each class, j = 1,2,...,C; the histo- 
gram, raw or smoothed (c.f. Wegman (1979)), or a simple model 
(normal, lognormal, etc.) may do well. By Bayes' theorem, then 



f (x|E.)P(E.) 

P{E_.|x} = ^ 3 

l f(x|E.)P{E.} 
j = l 11 => 



here P{Ej} is the overall marginal probability of an event in 
class j, estimable from history. Such an approach is under 
empirical investigation by the author, utilizing some meteoro- 
logical data. 
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5. Time Series 



Many of the data sets encountered in studies of the natural 
environment have distinctive time-series structure. This means 
that successive observations in time (and contiguous observations 
in space) are likely to be rather similar or correlated as a 
result of both (a) important natural phenomena associate system- 
atically with seasons, years, geographical places, and forces of 
nature, and also (b) what may be considered to be the haphazard 
superposition of a variety of additional effects, among which may 
be measurement error. These latter effects typically do not 
appear independent from observation to observation, a fact that 
represents both opportunity and difficulty in statistical analysis. 
In the present section we review a few basic notions and concepts 
in time series analysis. The topic is subtle, and deserves more 
than it gets in this discussion. 

5.1. Components of a Time Series 

One way of thinking about a time series, say of monthly 
total precipitation at a point in space, or visibility at a point, 
is in terms of the simple components 
o (apparent) trend 



■ systematic effect 



o (apparent) seasonal effect 

o (random) disturbance or noise. 

The trend ideally represents a systematic long-term change; 
"apparent" is appended because a steady trend may actually be 
quite impermanent, giving way to a new and different trend. Think 
of the economy, particularly the stock market indices. Mean sea 
levels measured by tide gauges are also of this nature, possibly 
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because a regime of slow, regular, changes in land supporting a 
tide gauge may rather suddenly be supplanted by a different 
regime, perhaps the result of human activity. Sometimes a trend 
may be taken to be linear, at least provisionally. The seasonal 
effect often seems inevitable: ordinarily it rains in the winter 

in Monterey, and not in the summer. Tides behave in accordance 
with time of day, with a slow trend superimposed. Left over is 
the contribution of random disturbance or noise, which represents 
the joint influence of other conditions, including measurement 
error. Interestingly, the latter random component has received 
more sophisticated mathematical-statistical attention than have 
the other components. Understanding trends and seasonals 
(apparently systematic effects) probably depends upon understand- 
ing the underlying subject-matter area. 

5.2. Decomposition 

Investigation of a time series z^., t = 1,2,... 
ought to begin by a study of graphical display. The inter-rela- 
tions between several series is often suggested by such an approach. 
Suppose one wishes to isolate the semi-permanent component of the 
series (trend and seasonal) . This can be approached in the 
following ways: 

o Fit a specific function to the trend, 

o Smooth , either non-robustly or robustly. 

If a plot shows a nearly linear trend (perhaps after transforma- 
tion, e.g. by logs) one can fit 

z t vs a + bt t = 1,2,... ,T 
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by least squares. Now if the true situation were well-represented 
by the model 



Z = a + bt + e t 

with {e fc , t=l,2,...} a sequence of non-independent random 
variables with, nevertheless E[£ t ] = 0, var[£^_] = o 2 , the 

Ordinary Least Squares is less than perfectly efficient. However 
it is ordinarily a useful approach — maybe all that is available. 

A A 

Once a LS » b LS are available one can subtract away the fit, 
leaving the residuals 



" Z t " a LS " b LS t 



for study. If the trend removal has been effective, and no inter- 
vening seasonal has occurred, the r t 's will have roughly constant 
variation for all t (box plot segments of the series for diagnosis) , 
and a characterization or study of the r t ' s is i n order. Note 
that if the covariance function 



f (t) 




T-T 



l 

t=l 



r 



t+T 



remains and diminishes only slowly as t increases this signifies 
substantial similarities between observations separated in time 
contributed by something (here it is assumed to be the noise com- 
ponent, although it might well be an un-removed contribution by 
seasonal, or a change in trend) . Often f(x) looks something 
like this 
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5.3. Noise Models 

It is convenient to employ simple parametric models 
to describe noise behavior. Here are several 



Simple Markov or Autoregressive: = P £ -t--l + a t' 

{a t , t = l,2,...} is "white noise," a sequence of 

independently Gaussian variables. 

*t + a t-l + * * * a t-p 



Simple Moving Average: = 

p an integer > 0. 



P+1 



o Autoregressive-Moving Average (ARMA (1,1)): 

e t - pe t-i = a t - 0a t-i 



It is well to start with the simple Markov, for which one may 
estimate p from the residuals by 



P 



1 

T-l 



T 



l 

t=l 



r t r 



t-l 



The theoretical autocovariance function of the Markov (AR) is 
actually 

f T = a 2 pl T ^ | t | = 0,1,2, . . . 

so as a check (p) T should die off exponentially. A plot on 
semi-log paper is helpful for a check. 
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Although least squares may provide reasonable estimates 
for parameters a and b, the attempt to use the sum of the 
squares of the residuals to estimate a 2 (noise variance) , and 
then to use this estimate in conventional formulas for standard 
errors and confidence limits for a and b is often a bad 
mistake: the "standard" results, such as those found in regres- 

sion packages at computer centers, are likely to give wildly 
optimistic (falsely precise) results. Such effects have been 
noticed by Abreu (1980) in an investigation of sea level trends 
on the U.S. Pacific Coast. 

Note that i£ a linear trend has been fitted by OLS , 
and the residuals appear to be AR(1) , then the variance of the 

A 

slope term estimate, ^OLS' a PP rox; *- ma tely that given by the 

OLS formulas appropriate for purely random noise, multiplied by 




For more detail see Bloomfield (1980) . 



APPENDIX 



The purpose of this appendix is to report the results of 
simulation tests of the performance of Winsorized t confidence 
limits as compared to ordinary Student's t. In particular, 
comparative performance is reported, as measured by (i) confidence 
interval coverage (fraction of intervals actually containing the 
true value of the mean) , and (ii) confidence interval average 
(estimate of expected) width. 

o Simulated Data . The "data" were obtained as follows. 
Alternatively, 

(a) Samples of size n = 20 from Normal (0,1) . The 
random variable of which the observations are 
independent instances will be called Z. 

(b) Samples of size n = 20 from the distribution of 

hZ 2 

Y = Z e , h > 0. The y-values are more spread 
out (but still symmetrically so) than are the basic 
z's, in order to represent long, fat, tails possibly 
resulting from outliers. The above convenient form 
has been suggested by Tukey. 

(c) Samples of size n = 20 from the distribution of 
W = (e g Z -l)/g', the latter being recognizable as 
the asymmetric log-Normal form. 

o Experimental Sampling . Suppose that a sample is available 
(from one of the above situations, but the exact source being 
unknown) , compute confidence limits for the mean of the parent 
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distribution, using (A) ordinary Student's t, and (B) Winsorized 
t, using g = 2, i.e. the lowest two values were given the value 
of x ( 3 )' and the highest two values the value of x ( 1 8 ) * Do so 
repeatedly (k = 1000 times at present) and compare for (i) cover- 
age to nominal (here 95% two-sided) and (ii) average width of 
confidence intervals. These parameter values were examined: 



g- : 


= 0. 


2, h 


= 1,00, 


and h = 2.00 




o Results 


of 


the Sampling Experiment. 




Case 


SLl 


h 


Method 


Coverage ( %) 


Av. Width 


Normal (Z) 


0 


0 


Student 


95.5 


0.92 


Long-Tail (Y) 


0 


1.00 


Student 


99.3 


8658 


Skewed (W) 


0.2 


0 


Student 


94.8 


0.95 


Long-Tail ( Y) 


0 


1.00 


Winsor (2 ) 


96.4 


8.46 


Skewed (W) 


0.2 


0 


Winsor (2) 


95.3 


0.98 


Long-Tail (Y) 


0 


2.00 


Student 


99.8 


4 . 3xl0 10 


Long-Tail (W) 


0 


2.00 


Winsor (2 ) 


98.0 


548 



o Conclusions . Obvious indication of the efficacy of Winsor- 
izing long-tailed (Y) observations as compared to traditional 
student's t. At that, the level of Winsorization (g=2) is 
probably too small, and a larger value would provide valid 
narrower, intervals. Alternatively, utilize biweights. 
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